In this notebook, we look at the oxidation present in the raw data set and show that taking them into account or not, changes the merged data set.



In [1]:

    
from msdas import *
%pylab inline









    



Couldn't import dot_parser, loading of dot files will not be possible.
Populating the interactive namespace from numpy and matplotlib



In [2]:

    
r = MassSpecReader(get_yeast_raw_data())









    



INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/Yeast_all_raw.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- 200 rows have ambiguous psites and are removed
INFO:root:save data in attribute _ambiguous_psites_df
INFO:root:--------------------------------------------------
INFO:root:-- Removing 125 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Identifiers are not unique. Have you called merge_peptides() ?



In [3]:

    
r.df.shape









    Out[3]:





(8570, 115)



In [4]:

    
r.plot_phospho_stats()



In [5]:

    
# which row contains Oxidation in its sequence ? 
df = r.df[r.df.Sequence_Phospho.apply(lambda x: "Oxidation" in x)]
# we can build an new MassSpecReader instance from this dataframe:
oxidation = MassSpecReader(df)









    



INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Identifiers are not unique. Have you called merge_peptides() ?



In [6]:

    
oxidation.df.shape









    Out[6]:





(415, 115)



In [7]:

    
oxidation.plot_phospho_stats()
# it looks like it is representative of the big data set(see figures above)



In [8]:

    
# similarly for the numner of NAs
clf()
r.get_na_count().hist(normed=True, alpha=0.5)
oxidation.get_na_count().hist(normed=True, alpha=0.5, color="green")
# Here we see that number of NAs









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x8b3c350>

Let us figure out if some proteins in the small data set have peptides with oxidation:



In [9]:

    
y = MassSpecReader(get_yeast_small_data())









    



INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/YEAST_small_all.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost



In [10]:

    
proteins = list(set(y.df.Protein))



In [11]:

    
filter_proteins  = oxidation.df.Protein.apply(lambda x: x in proteins)
subdf = oxidation.df[filter_proteins].Protein
found = list(set(subdf.values))



In [12]:

    
found









    Out[12]:





['STE20', 'RCK2', 'HOG1', 'STE12', 'FUS3']

Effect of the Oxidation in the merging from raw data to small data set.

We will look at STE12 case found in the list above.



In [13]:

    
Y = replicates.ReplicatesYeast(get_yeast_raw_data(), verbose=True, cleanup=True)
Y.normalise()









    



INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/Yeast_all_raw.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- 200 rows have ambiguous psites and are removed
INFO:root:save data in attribute _ambiguous_psites_df
INFO:root:--------------------------------------------------
INFO:root:-- Removing 125 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Identifiers are not unique. Have you called merge_peptides() ?



In [14]:

    
clf(); 
res1 = y.plot_timeseries("STE12_S400")
res2 = Y.plot_timeseries("STE12_S400", color="g", markersize=5)









    



WARNING:root:More than 1 row found. Consider calling merging_ambiguous_peptides method

Here, we have in red the data from the small data set
In green, the two row data that correspond to peptide STE12_400. There are 2: one with oxidation tag and one without. But this is the same peptides (see next cell).
One green data set correspond exactly to the samll data set, so it shows that peptide with oxidation are removed, as confirmed by looking at the data set.



In [15]:

    
Y['STE12_S400']









    Out[15]:






  
    
      
      Protein
      Sequence
      Psite
      Sequence_Phospho
      a0_t0
      a0_t0.1
      a0_t0.2
      a0_t1
      a0_t1.1
      a0_t1.2
      ...
      a45_t10.2
      a45_t20
      a45_t20.1
      a45_t20.2
      a45_t45
      a45_t45.1
      a45_t45.2
      Entry
      Entry_name
      Identifier
    
  
  
    
      2881
      STE12
      LVSPSDPTSYMK
      S400
      LVS(Phospho)PSDPTSYM(Oxidation)K
      NaN
      NaN
      NaN
      NaN
      NaN
      0.000077
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      P13574
      STE12_YEAST
      STE12_S400
    
    
      2882
      STE12
      LVSPSDPTSYMK
      S400
      LVS(Phospho)PSDPTSYMK
      0.000254
      0.000207
      0.000242
      0.000186
      0.000215
      0.000179
      ...
      0.000092
      0.000118
      0.000125
      NaN
      NaN
      0.000197
      0.000109
      P13574
      STE12_YEAST
      STE12_S400
    
  

2 rows × 115 columns



In [ ]:

	Protein	Sequence	Psite	Sequence_Phospho	a0_t0	a0_t0.1	a0_t0.2	a0_t1	a0_t1.1	a0_t1.2	...	a45_t10.2	a45_t20	a45_t20.1	a45_t20.2	a45_t45	a45_t45.1	a45_t45.2	Entry	Entry_name	Identifier
2881	STE12	LVSPSDPTSYMK	S400	LVS(Phospho)PSDPTSYM(Oxidation)K	NaN	NaN	NaN	NaN	NaN	0.000077	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	P13574	STE12_YEAST	STE12_S400
2882	STE12	LVSPSDPTSYMK	S400	LVS(Phospho)PSDPTSYMK	0.000254	0.000207	0.000242	0.000186	0.000215	0.000179	...	0.000092	0.000118	0.000125	NaN	NaN	0.000197	0.000109	P13574	STE12_YEAST	STE12_S400